Add .tar.gz / .tgz archive support for extension, preset, and workflow installation by Copilot · Pull Request #2395 · github/spec-kit

Copilot · 2026-04-28T17:49:12Z

The extension, preset, and workflow download pipelines only accepted ZIP archives, blocking use of npm registries and CI artifact stores that serve tarballs natively.

Core utilities (`extensions.py`)

detect_archive_format(url, content_type="") — infers format from URL path extension (.zip, .tar.gz, .tgz) with Content-Type header fallback (application/gzip, application/x-gzip, application/x-tar+gzip)
safe_extract_tarball(archive_path, dest_dir, error_class) — safe extraction with:
- Pre-extraction validation of all members: rejects absolute paths, .. traversal, symlinks, hard links, devices, and FIFOs
- PAX headers (XHDTYPE, XGLTYPE, SOLARIS_XHDTYPE) and GNU metadata-only entries (GNUTYPE_LONGNAME, GNUTYPE_LONGLINK) are silently skipped — they carry no extractable payload and are emitted by many common archiving tools; GNUTYPE_SPARSE is intentionally not skipped because sparse entries carry a real file payload and isreg() returns True for them
- All Python versions: passes pre-validated safe_members list to extractall() to ensure only vetted entries are extracted
- Python 3.12+: additionally uses tarfile.data_filter for extra OS-level protection, combined with members=safe_members
- tarfile.TarError/OSError are caught and re-raised as the caller-supplied error_class for consistent error handling

Both helpers are public (no underscore prefix) and imported directly by presets.py and __init__.py.

Extensions & presets

install_from_zip() on both managers now detects archive format from the file extension and dispatches to ZIP or tarball extraction accordingly — existing callers are unaffected
download_extension() / download_pack() capture response.geturl() as the canonical post-redirect URL and detect archive format from the final URL first (with Content-Type fallback), only consulting the original download_url as a last hint. This ensures redirects that change the archive extension or type are honored when choosing the on-disk suffix and dispatching to the correct extractor. Unknown formats are rejected with a clear error rather than silently defaulting to ZIP, and the final URL's scheme is re-validated to guard against scheme-downgrade via redirects.

`init.py` call sites

Path	Change
`extension add --from`	Detects format from post-redirect `final_url` first (Content-Type fallback), then original URL as last hint; unknown formats rejected; HTTPS re-checked on post-redirect URL
`preset add --from`	Same
`extension update`	Inline manifest peek handles both ZIP and tar.gz; cache filename sanitized via `Path(extension).name` to prevent path traversal
`workflow add` (URL)	Extracts `workflow.yml` from archive when URL points to one; temp-file paths initialized before write to avoid `UnboundLocalError` on disk-full
`workflow add` (local)	Accepts local `.tar.gz`/`.tgz`/`.zip` archive files (case-insensitive detection)
`workflow add` (catalog)	Same archive detection for catalog-sourced URLs

A shared _extract_workflow_yml(archive_path, fmt) helper handles root-level and single-nested-directory layouts for both formats, with tarfile.extractfile() handles properly closed via context managers.

Tests

30 new tests across test_extensions.py, test_presets.py, and test_workflows.py covering:

test_extensions.py: format detection (URL + Content-Type), flat and nested tarball install, missing manifest errors, path traversal rejection, symlink rejection
test_presets.py: flat and nested tarball install, missing manifest errors, path traversal rejection, symlink rejection
test_workflows.py (TestWorkflowAddArchive, 9 CLI-level tests): local ZIP (flat/nested), local .tar.gz (flat/nested), .tgz alias, missing workflow.yml error cases, URL-based archive download for both ZIP and tar.gz formats

…installation Agent-Logs-Url: https://github.com/github/spec-kit/sessions/9fb9a8ea-0967-4baf-b95c-7101e423ff58 Co-authored-by: mnriem <15701806+mnriem@users.noreply.github.com>

Agent-Logs-Url: https://github.com/github/spec-kit/sessions/9fb9a8ea-0967-4baf-b95c-7101e423ff58 Co-authored-by: mnriem <15701806+mnriem@users.noreply.github.com>

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Copilot

Copilot's findings

Files reviewed: 6/6 changed files
Comments generated: 1

+            # On older versions pass only the pre-validated members so that no
+            # unvetted entry (added concurrently or via a race) slips through.
+            if sys.version_info >= (3, 12):
+                tf.extractall(dest_dir, filter="data")  # type: ignore[call-arg]


mnriem · 2026-05-11T15:26:34Z

@copilot apply changes based on the comments in this thread

….12+

Copilot · 2026-05-11T15:29:15Z

@copilot apply changes based on the comments in this thread

Fixed in commit — the Python 3.12+ branch now calls tf.extractall(dest_dir, members=safe_members, filter="data"), so the pre-validated member list is respected on all Python versions and metadata-only/rejected entries cannot slip through on 3.12+.

Copilot

Copilot's findings

Files reviewed: 6/6 changed files
Comments generated: 4

+        # Detect archive format from URL; resolve via Content-Type when needed.
+        # `final_url` may differ from `download_url` if the server redirects.
+        archive_fmt = detect_archive_format(download_url)
+
+        # Download the archive
        try:
            with self._open_url(download_url, timeout=60) as response:
-                zip_data = response.read()
-
-            zip_path.write_bytes(zip_data)
-            return zip_path
+                final_url = response.geturl()
+                if not archive_fmt:
+                    content_type = response.headers.get("Content-Type", "")
+                    archive_fmt = detect_archive_format(final_url, content_type)
+                archive_data = response.read()


+        # Detect archive format from URL; resolve via Content-Type when needed.
+        # `final_url` may differ from `download_url` if the server redirects.
+        archive_fmt = detect_archive_format(download_url)

        try:
            with self._open_url(download_url, timeout=60) as response:
-                zip_data = response.read()
-
-            zip_path.write_bytes(zip_data)
-            return zip_path
+                final_url = response.geturl()
+                if not archive_fmt:
+                    content_type = response.headers.get("Content-Type", "")
+                    archive_fmt = detect_archive_format(final_url, content_type)
+                archive_data = response.read()


            with tempfile.TemporaryDirectory() as tmpdir:
-                zip_path = Path(tmpdir) / "preset.zip"
+                archive_fmt = _det_fmt(from_url)
+                final_url = from_url
                try:
                    with urllib.request.urlopen(from_url, timeout=60) as response:
-                        zip_path.write_bytes(response.read())
+                        final_url = response.geturl()
+                        if not archive_fmt:
+                            content_type = response.headers.get("Content-Type", "")
+                            archive_fmt = _det_fmt(final_url, content_type)
+                        archive_data = response.read()


+                archive_fmt = detect_archive_format(from_url)
+                archive_path = None

                try:
                    with urllib.request.urlopen(from_url, timeout=60) as response:
-                        zip_data = response.read()
-                    zip_path.write_bytes(zip_data)
+                        final_url = response.geturl()
+                        if not archive_fmt:
+                            content_type = response.headers.get("Content-Type", "")
+                            archive_fmt = detect_archive_format(final_url, content_type)
+                        archive_data = response.read()


mnriem · 2026-05-12T13:17:43Z

@copilot apply changes based on the comments in this thread

…wnload paths

Copilot · 2026-05-12T13:20:20Z

@copilot apply changes based on the comments in this thread

Applied the change in all 4 URL-download paths (extensions.py download_extension(), presets.py download_pack(), __init__.py preset add --from, __init__.py extension add --from):

Format detection now runs against the post-redirect final_url first (with Content-Type as fallback inside the same call) and only falls back to the original URL as a last hint when the final URL gives no signal. This way a redirect that changes the archive extension or type is honored when choosing the on-disk suffix and dispatching to the correct extractor.

+        # post-redirect URL first (with Content-Type fallback); only use the
+        # original `download_url` as a last hint if the final URL gives no
+        # signal.
+        final_url = download_url


+        # Determine the archive format from the post-redirect URL first
+        # (with Content-Type fallback); only use the original `download_url`
+        # as a last hint if the final URL gives no signal.
+        final_url = download_url


Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Copilot

Copilot's findings

Comments suppressed due to low confidence (1)

src/specify_cli/init.py:3679

final_url is checked for HTTPS only after archive_data = response.read(). This means an insecure redirect can still be fetched before you abort. To make the scheme-downgrade guard effective, validate the scheme right after response.geturl() and before reading the response body (or disallow non-HTTPS redirects in the opener).

                    with urllib.request.urlopen(from_url, timeout=60) as response:
                        final_url = response.geturl()
                        content_type = response.headers.get("Content-Type", "")
                        archive_fmt = detect_archive_format(final_url, content_type)
                        if not archive_fmt:
                            archive_fmt = detect_archive_format(from_url)
                        archive_data = response.read()

Files reviewed: 6/6 changed files
Comments generated: 3

            with self._open_url(download_url, timeout=60) as response:
-                zip_data = response.read()
-
-            zip_path.write_bytes(zip_data)
-            return zip_path
+                final_url = response.geturl()
+                content_type = response.headers.get("Content-Type", "")
+                archive_fmt = detect_archive_format(final_url, content_type)
+                if not archive_fmt:
+                    archive_fmt = detect_archive_format(download_url)
+                archive_data = response.read()


            with self._open_url(download_url, timeout=60) as response:
-                zip_data = response.read()
-
-            zip_path.write_bytes(zip_data)
-            return zip_path
+                final_url = response.geturl()
+                content_type = response.headers.get("Content-Type", "")
+                archive_fmt = detect_archive_format(final_url, content_type)
+                if not archive_fmt:
+                    archive_fmt = detect_archive_format(download_url)
+                archive_data = response.read()



                    with urllib.request.urlopen(from_url, timeout=60) as response:
-                        zip_path.write_bytes(response.read())
+                        final_url = response.geturl()
+                        content_type = response.headers.get("Content-Type", "")
+                        # Prefer the post-redirect URL for format detection;
+                        # fall back to the original URL only as a last hint.
+                        archive_fmt = _det_fmt(final_url, content_type)
+                        if not archive_fmt:
+                            archive_fmt = _det_fmt(from_url)
+                        archive_data = response.read()


mnriem · 2026-05-12T16:20:54Z

@copilot apply changes based on the comments in this thread

Copilot · 2026-05-12T16:23:39Z

@copilot apply changes based on the comments in this thread

Applied in 5d75366: in all four URL-download paths (extensions.py download_extension(), presets.py download_pack(), __init__.py preset add --from, __init__.py extension add --from), the post-redirect scheme re-validation now runs immediately after response.geturl() and before response.read(). A scheme-downgraded redirect now aborts before any body is fetched.

Copilot

Copilot's findings

Files reviewed: 6/6 changed files
Comments generated: 4

+                archive_fmt = detect_archive_format(final_url)
+                if not archive_fmt:
+                    content_type = resp.headers.get("Content-Type", "")
+                    archive_fmt = detect_archive_format(final_url, content_type)


+            # Detect archive format from the final URL or Content-Type header.
+            cat_archive_fmt = detect_archive_format(final_url)
+            if not cat_archive_fmt:
+                cat_ct = response.headers.get("Content-Type", "")
+                cat_archive_fmt = detect_archive_format(final_url, cat_ct)


+                # Extract ZIP safely (prevent Zip Slip attack)
+                with zipfile.ZipFile(zip_path, 'r') as zf:
+                    # Validate all paths first before extracting anything
+                    temp_path_resolved = temp_path.resolve()
+                    for member in zf.namelist():
+                        member_path = (temp_path / member).resolve()
+                        # Use is_relative_to for safe path containment check
+                        try:
+                            member_path.relative_to(temp_path_resolved)
+                        except ValueError:
+                            raise ValidationError(
+                                f"Unsafe path in ZIP archive: {member} (potential path traversal)"
+                            )
+                    # Only extract after all paths are validated
+                    zf.extractall(temp_path)


+                with zipfile.ZipFile(zip_path, 'r') as zf:
+                    temp_path_resolved = temp_path.resolve()
+                    for member in zf.namelist():
+                        member_path = (temp_path / member).resolve()
+                        try:
+                            member_path.relative_to(temp_path_resolved)
+                        except ValueError:
+                            raise PresetValidationError(
+                                f"Unsafe path in ZIP archive: {member} "
+                                "(potential path traversal)"
+                            )
+                    zf.extractall(temp_path)


Initial plan

1bda2f0

Copilot AI assigned Copilot and mnriem Apr 28, 2026

Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 17:49

Copilot started work on behalf of mnriem April 28, 2026 17:49 View session

Copilot AI linked an issue Apr 28, 2026 that may be closed by this pull request

[Enhancement] Support .tar.gz / .tgz archives for extension, preset, and workflow installation #2377

Open

Add .tar.gz/.tgz archive support for extension, preset, and workflow …

a434e5a

…installation Agent-Logs-Url: https://github.com/github/spec-kit/sessions/9fb9a8ea-0967-4baf-b95c-7101e423ff58 Co-authored-by: mnriem <15701806+mnriem@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:04

Address code review: fix import style and rename local aliases

b37f117

Agent-Logs-Url: https://github.com/github/spec-kit/sessions/9fb9a8ea-0967-4baf-b95c-7101e423ff58 Co-authored-by: mnriem <15701806+mnriem@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:06

Improve tarball extraction security and cleanup logic

b3a60f5

Agent-Logs-Url: https://github.com/github/spec-kit/sessions/9fb9a8ea-0967-4baf-b95c-7101e423ff58 Co-authored-by: mnriem <15701806+mnriem@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:09

Remove unnecessary import aliases, use consistent names

d78ead1

Agent-Logs-Url: https://github.com/github/spec-kit/sessions/9fb9a8ea-0967-4baf-b95c-7101e423ff58 Co-authored-by: mnriem <15701806+mnriem@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:12

Fix type hint, add null checks for tf.extractfile() return value

0c6cc45

Agent-Logs-Url: https://github.com/github/spec-kit/sessions/9fb9a8ea-0967-4baf-b95c-7101e423ff58 Co-authored-by: mnriem <15701806+mnriem@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:14

Copilot AI changed the title ~~[WIP] Add support for .tar.gz and .tgz archives~~ Add .tar.gz / .tgz archive support for extension, preset, and workflow installation Apr 28, 2026

Copilot finished work on behalf of mnriem April 28, 2026 18:16

Copilot AI requested a review from mnriem April 28, 2026 18:16

github-code-quality Bot found potential problems Apr 28, 2026

View reviewed changes

Comment thread src/specify_cli/__init__.py Fixed

mnriem marked this pull request as ready for review May 5, 2026 22:58

mnriem requested review from Copilot and removed request for Copilot May 5, 2026 22:58

github-code-quality Bot found potential problems May 7, 2026

View reviewed changes

Comment thread src/specify_cli/__init__.py Fixed

Comment thread src/specify_cli/extensions.py Fixed

Comment thread src/specify_cli/presets.py Fixed

Potential fix for pull request finding 'Variable defined multiple times'

7ff9c8b

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 8, 2026 19:11

mnriem and others added 2 commits May 8, 2026 14:11

Potential fix for pull request finding 'Variable defined multiple times'

eec1291

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Potential fix for pull request finding 'Variable defined multiple times'

0825f50

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Copilot started reviewing on behalf of mnriem May 8, 2026 19:12 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread src/specify_cli/extensions.py Outdated

# On older versions pass only the pre-validated members so that no

# unvetted entry (added concurrently or via a race) slips through.

if sys.version_info >= (3, 12):

tf.extractall(dest_dir, filter="data") # type: ignore[call-arg]

Copilot started work on behalf of mnriem May 11, 2026 15:26 View session

Fix safe_extract_tarball: pass safe_members to extractall on Python 3…

a8320d9

….12+

Copilot finished work on behalf of mnriem May 11, 2026 15:30

mnriem requested a review from Copilot May 11, 2026 20:43

Copilot started reviewing on behalf of mnriem May 11, 2026 20:44 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Copilot started work on behalf of mnriem May 12, 2026 13:17 View session

Prefer final_url over original URL for archive format detection in do…

a69d427

…wnload paths

Copilot finished work on behalf of mnriem May 12, 2026 13:21

github-code-quality Bot found potential problems May 12, 2026

View reviewed changes

Potential fix for pull request finding 'Variable defined multiple times'

7344071

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 12, 2026 14:57

Copilot AI reviewed May 12, 2026

View reviewed changes

Copilot started work on behalf of mnriem May 12, 2026 16:21 View session

Validate post-redirect URL scheme before reading response body

5d75366

Copilot finished work on behalf of mnriem May 12, 2026 16:24

Copilot AI reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add .tar.gz / .tgz archive support for extension, preset, and workflow installation#2395

Add .tar.gz / .tgz archive support for extension, preset, and workflow installation#2395
Copilot wants to merge 23 commits into
mainfrom
copilot/add-tar-gz-support

Copilot AI commented Apr 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

mnriem commented May 11, 2026

Uh oh!

Copilot AI commented May 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

mnriem commented May 12, 2026

Uh oh!

Copilot AI commented May 12, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

mnriem commented May 12, 2026

Uh oh!

Copilot AI commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Core utilities (extensions.py)

Extensions & presets

__init__.py call sites

Tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

mnriem commented May 11, 2026

Uh oh!

Copilot AI commented May 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

mnriem commented May 12, 2026

Uh oh!

Copilot AI commented May 12, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

mnriem commented May 12, 2026

Uh oh!

Copilot AI commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Apr 28, 2026 •

edited

Loading

Core utilities (`extensions.py`)

`init.py` call sites